Stock Market Trend Forecast

Using Technical Analysis and Machine Learning

Stock
Technical Analysis
Quantitative Analysis
Machine Learning
Author

Hoang Son Lai

Data Description

Code
# Load libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import table
import os
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook_connected"
pio.templates.default = "plotly_white"
import pandas as pd
import pandas_ta as ta
import mplfinance as mpf
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import ipywidgets as widgets
from IPython.display import display, clear_output
# To convert to html, use quarto render ml_report.ipynb

# Load Data
try:
    df = pd.read_csv('report_data/stock_prices.csv')
    df['date'] = pd.to_datetime(df['date'])
    # Sort by Ticker and Date for accurate calculation
    df = df.sort_values(['ticker', 'date'])
except FileNotFoundError:
    print("Error: File 'stock_prices.csv' not found.")

This dataset provides a robust historical record of market performance for 21 unique tickers, spanning from December 7, 2020, to February 3, 2026.

Notably, the data exhibits exceptional quality and balance: every ticker contains exactly 1,295 trading days, ensuring a uniform time-series structure. With 0% missing values and no duplicate entries found, this clean dataset serves as a reliable foundation for technical analysis and algorithmic backtesting.

Code
def describe_market_data(df):
# 1. Basic Structure
    print(f"▶ Total Rows:       {df.shape[0]:,}")
    print(f"▶ Total Columns:    {df.shape[1]}")
    print(f"▶ Column Name: {df.columns}")
    
# 2. Temporal Analysis (Date Range)
    # Ensure date column is datetime
    if 'date' in df.columns:
        if not pd.api.types.is_datetime64_any_dtype(df['date']):
            try:
                df['date'] = pd.to_datetime(df['date'])
                print("✓ 'date' column converted to datetime format.")
            except:
                print("⚠ Warning: Could not convert 'date' column.")
        
        start_date = df['date'].min().strftime('%Y-%m-%d')
        end_date = df['date'].max().strftime('%Y-%m-%d')
        duration = (df['date'].max() - df['date'].min()).days
        
        print("\nTIME PERIOD:")
        print(f"▶ Start Date:       {start_date}")
        print(f"▶ End Date:         {end_date}")
        print(f"▶ Duration:         {duration} days")
    else:
        print("\n⚠ 'date' column not found!")

    # 3. Ticker Analysis
    if 'ticker' in df.columns:
        unique_tickers = df['ticker'].nunique()
        tickers_list = df['ticker'].unique()
        
        print("\nTICKER STATISTICS:")
        print(f"▶ Unique Tickers:   {unique_tickers}")
        
        # Check for data balance (Top 5 and Bottom 5 tickers by row count)
        ticker_counts = df['ticker'].value_counts()
        print(f"▶ Most Data:        {ticker_counts.index[0]} ({ticker_counts.iloc[0]} rows)")
        print(f"▶ Least Data:       {ticker_counts.index[-1]} ({ticker_counts.iloc[-1]} rows)")
        
        # Example list
        if unique_tickers > 10:
            print(f"▶ Examples:         {', '.join(tickers_list[:5])} ... {', '.join(tickers_list[-5:])}")
        else:
            print(f"▶ List:             {', '.join(tickers_list)}")
    else:
        print("\n⚠ 'ticker' column not found!")

    # 4. Data Quality Check
    print("\nDATA QUALITY CHECK:")
    
    # Missing Values
    missing_data = df.isnull().sum()
    total_cells = np.prod(df.shape)
    total_missing = missing_data.sum()
    
    print(f"▶ Missing Values:   {total_missing:,} cells ({total_missing/total_cells:.2%})")
    
    if total_missing > 0:
        print("  - Columns with most missing values:")
        print(missing_data[missing_data > 0].sort_values(ascending=False).head(5).to_string(header=False))

    # Duplicates Check (assuming Ticker + Date should be unique)
    if 'ticker' in df.columns and 'date' in df.columns:
        duplicates = df.duplicated(subset=['ticker', 'date']).sum()
        if duplicates > 0:
            print(f"⚠ CRITICAL: Found {duplicates} duplicate rows based on Ticker + Date!")
        else:
            print("✓ Integrity Check:  No duplicate (Ticker + Date) pairs found.")
        
describe_market_data(df)
▶ Total Rows:       27,195
▶ Total Columns:    9
▶ Column Name: Index(['id', 'ticker', 'date', 'open', 'high', 'low', 'close', 'adj_close',
       'volume'],
      dtype='str')

TIME PERIOD:
▶ Start Date:       2020-12-07
▶ End Date:         2026-02-03
▶ Duration:         1884 days

TICKER STATISTICS:
▶ Unique Tickers:   21
▶ Most Data:        AAPL (1295 rows)
▶ Least Data:       WMT (1295 rows)
▶ Examples:         AAPL, ADBE, AMZN, BAC, DIS ... PYPL, TSLA, UNH, V, WMT

DATA QUALITY CHECK:
▶ Missing Values:   0 cells (0.00%)
✓ Integrity Check:  No duplicate (Ticker + Date) pairs found.

Part 1. Technical Analysis

1.1 Objective

The purpose of this section is to identify medium- to long-term investment opportunities by analyzing price trends, momentum, and volatility. I evaluate three core dimensions for each stock:

  • Trend Strength – Is the stock in a sustained uptrend?

  • Momentum – Is buying pressure dominant?

  • Volatility – Is the stock stable enough for long-term holding?

To ensure objectivity and consistency, I apply a transparent, rule-based technical scoring system (maximum 7 points). Higher scores indicate stronger technical setups suitable for core portfolio holdings.

1.2 Indicator Calculation

I calculate the following key technical indicators on a per-ticker basis using groupby operations. This approach ensures calculations are performed independently for each stock, preventing look-ahead bias and data leakage across different securities.

Here is a detailed explanation of each indicator:

SMA50 & SMA200 (Simple Moving Averages):

The 50-day and 200-day simple moving averages represent short- and long-term price trends, respectively. The SMA200 is particularly important for identifying the primary (long-term) trend. A price trading above the SMA200 generally indicates a bullish long-term environment.

RSI (Relative Strength Index, 14-period):

The RSI measures the speed and magnitude of recent price changes to evaluate overbought or oversold conditions. Values above 70 suggest overbought conditions (potential pullback), while values below 30 indicate oversold conditions (potential rebound). In this framework, we consider the 40–70 range as “healthy momentum” for sustainable uptrends.

MACD (Moving Average Convergence Divergence):

The MACD consists of three components:

  • MACD line: The difference between the 12-day and 26-day EMAs.

  • Signal line: A 9-day EMA of the MACD line.

  • Histogram: The difference between the MACD line and the signal line.

A positive and rising histogram indicates strengthening bullish momentum.

ATR (Average True Range, 14-period):

ATR quantifies market volatility by measuring the average range between high and low prices over a period. Lower ATR values suggest more stable price movement, which is preferable for long-term investors who want to avoid excessive risk.

Code
# Work on clean copy
df_tech = df.copy()
df_tech = df_tech.sort_values(['ticker', 'date']).reset_index(drop=True)

# ----- Moving Averages -----

df_tech['SMA50'] = (
    df_tech.groupby('ticker')['close']
    .transform(lambda x: x.rolling(50, min_periods=50).mean())
)

df_tech['SMA200'] = (
    df_tech.groupby('ticker')['close']
    .transform(lambda x: x.rolling(200, min_periods=200).mean())
)

# ----- RSI -----
df_tech['RSI'] = (
    df_tech.groupby('ticker')['close']
    .transform(lambda x: ta.rsi(x, length=14))
)

# ----- MACD -----
macd_df = (
    df_tech.groupby('ticker')['close']
    .apply(lambda x: ta.macd(x))
)

# macd_df returns multi-index → fix it safely
macd_df = macd_df.reset_index(level=0, drop=True)

df_tech[['MACD','MACD_signal','MACD_hist']] = macd_df.values

# ----- ATR -----
df_tech['ATR'] = (
    df_tech.groupby('ticker')
    .apply(lambda x: ta.atr(x['high'], x['low'], x['close'], length=14))
    .reset_index(level=0, drop=True)
)

# Final sanity check
print(df_tech.columns)
Index(['id', 'ticker', 'date', 'open', 'high', 'low', 'close', 'adj_close',
       'volume', 'SMA50', 'SMA200', 'RSI', 'MACD', 'MACD_signal', 'MACD_hist',
       'ATR'],
      dtype='str')

1.3 Technical Scoring Framework

I constructed a rule-based scoring model to rank stocks objectively.

Scoring Rules (Maximum = 7 points):

  • Price > SMA200 → +2 points (Long-term uptrend)

  • SMA50 > SMA200 → +2 points (Bullish structure / Golden Cross)

  • RSI between 40 and 70 → +1 point (Healthy momentum, not overbought or oversold)

  • MACD Histogram > 0 → +1 point (Positive momentum)

  • ATR below cross-sectional median → +1 point (Controlled volatility, suitable for long-term holding)

Higher scores reflect stronger technical positioning for medium- to long-term investment.

Code
# ============================================
# TECHNICAL SCORING
# ============================================

latest_data = (
    df_tech
    .dropna(subset=['SMA200', 'SMA50', 'RSI', 'MACD_hist', 'ATR'])
    .sort_values(['ticker', 'date'])
    .groupby('ticker', as_index=False)
    .last()
    .copy()
)

atr_median = latest_data['ATR'].median()
latest_data['Technical_Score'] = 0

# Rules
latest_data.loc[latest_data['close'] > latest_data['SMA200'], 'Technical_Score'] += 2
latest_data.loc[latest_data['SMA50'] > latest_data['SMA200'], 'Technical_Score'] += 2
latest_data.loc[(latest_data['RSI'] >= 40) & (latest_data['RSI'] <= 70), 'Technical_Score'] += 1
latest_data.loc[latest_data['MACD_hist'] > 0, 'Technical_Score'] += 1
latest_data.loc[latest_data['ATR'] <= atr_median, 'Technical_Score'] += 1

# Final ranking table with styling
technical_ranking = (
    latest_data
    .sort_values('Technical_Score', ascending=False)
    [['ticker', 'close', 'SMA50', 'SMA200', 'RSI', 'MACD_hist', 'ATR', 'Technical_Score']]
    .reset_index(drop=True)
)

# Styled table
styled_table = technical_ranking.style\
    .format({
        'close': '{:.2f}',
        'SMA50': '{:.2f}',
        'SMA200': '{:.2f}',
        'RSI': '{:.2f}',
        'MACD_hist': '{:.4f}',
        'ATR': '{:.4f}',
        'Technical_Score': '{:.0f}'
    })\
    .set_table_styles([
        {'selector': 'th',
         'props': [('background-color', 'steelblue'),
                   ('color', 'white'),
                   ('font-weight', 'bold'),
                   ('text-align', 'center')]}
    ])\
    .background_gradient(subset=['Technical_Score'], cmap='RdYlGn', vmin=0, vmax=7)\
    .hide(axis='index')

styled_table
ticker close SMA50 SMA200 RSI MACD_hist ATR Technical_Score
AMZN 238.62 233.27 222.43 51.20 2.0392 6.0538 7
NVDA 180.34 183.78 168.49 42.39 0.8151 5.5977 7
AAPL 269.48 268.38 237.10 61.64 -3.4410 5.7653 6
GOOGL 339.71 320.15 236.20 64.11 6.3572 8.3686 6
BAC 54.45 53.99 48.90 55.86 -0.4719 1.0486 6
WMT 127.71 114.71 103.25 76.45 1.7854 2.6163 6
JNJ 233.10 211.16 178.97 82.03 4.8995 3.8927 6
KO 76.89 71.06 69.28 78.42 0.8591 1.1461 6
TSLA 421.96 444.33 378.96 42.56 -5.2252 15.4590 5
PG 155.32 145.44 152.82 69.74 1.3932 2.4784 5
JPM 314.85 313.15 292.76 54.39 -3.2201 6.4258 5
META 691.70 652.10 682.94 57.50 6.6052 21.5360 4
HD 381.10 359.42 371.62 59.69 6.1622 8.5331 4
UNH 284.18 328.36 324.68 31.48 -5.5849 11.6865 2
DIS 104.22 110.07 112.09 34.15 -0.1644 2.6815 1
PYPL 41.70 58.55 66.88 12.84 -1.6890 2.0007 1
MA 550.72 553.05 562.91 52.06 -6.9823 11.5092 1
V 328.93 337.68 344.12 45.09 -4.0178 6.9771 1
NFLX 79.94 93.60 112.67 21.28 -3.2426 2.3706 1
ADBE 271.93 326.49 355.65 23.30 -11.3831 9.4392 0
MSFT 411.21 473.26 485.25 27.38 -8.3554 13.1559 0

Key Findings:

  • AMZN and NVDA show the strongest technical setups (score 7/7).

  • Stocks scoring 6–7 are considered high-conviction candidates.

  • Several names (e.g., PYPL, ADBE, MSFT) score 0–1, indicating weak trends or excessive volatility.

A stock is considered technically strong and suitable for medium- to long-term holding if it meets most of the following: sustained price above SMA200, Golden Cross, healthy RSI, positive MACD histogram, and moderate ATR.

1.4 Interactive Technical Charts

The following interactive dashboards provide a comprehensive visualization of each stock’s technical condition:

  • Price Action – Historical closing prices to observe overall movement and structure

  • Trend Indicators – 50-day and 200-day Simple Moving Averages (SMA50 & SMA200) to evaluate long-term trend direction

  • Momentum (RSI) – Relative Strength Index to assess overbought and oversold conditions

  • MACD & Histogram – Momentum acceleration and trend confirmation signals

Users can dynamically switch between tickers using the dropdown menu located at the top-left of each chart.

Code
df_plot = df_tech.copy() 
df_plot['date'] = pd.to_datetime(df_plot['date']) 
df_plot = df_plot.sort_values(['ticker', 'date']) 
tickers = df_plot['ticker'].unique()

# Ensure datetime
df_plot['date'] = pd.to_datetime(df_plot['date'])

# Get full date range for padding
min_date = df_plot['date'].min()
max_date = df_plot['date'].max()
date_padding = pd.Timedelta(days=10)

fig = make_subplots(
    rows=3, cols=1,
    shared_xaxes=True,
    vertical_spacing=0.08,
    row_heights=[0.5, 0.25, 0.25],
    subplot_titles=("Price & Moving Averages", "RSI", "MACD")
)

TRACES_PER_TICKER = 7
visibility_matrix = []

for i, ticker in enumerate(tickers):

    data = df_plot[df_plot['ticker'] == ticker]
    visible = [False] * (len(tickers) * TRACES_PER_TICKER)
    base = i * TRACES_PER_TICKER
    for j in range(TRACES_PER_TICKER):
        visible[base + j] = True
    visibility_matrix.append(visible)

    # ================= PRICE =================
    fig.add_trace(go.Scatter(
        x=data['date'], y=data['close'],
        mode='lines',
        name='Close',
        legendgroup=f'price_{ticker}',
        legend='legend1',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>Close: %{y:.2f}<extra></extra>"
    ), row=1, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['SMA50'],
        mode='lines',
        name='SMA50',
        legendgroup=f'price_{ticker}',
        legend='legend1',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>SMA50: %{y:.2f}<extra></extra>"
    ), row=1, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['SMA200'],
        mode='lines',
        name='SMA200',
        legendgroup=f'price_{ticker}',
        legend='legend1',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>SMA200: %{y:.2f}<extra></extra>"
    ), row=1, col=1)

    # ================= RSI =================
    fig.add_trace(go.Scatter(
        x=data['date'], y=data['RSI'],
        mode='lines',
        name='RSI',
        legendgroup=f'rsi_{ticker}',
        legend='legend2',
        visible=(i==0),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>RSI: %{y:.2f}<extra></extra>"
    ), row=2, col=1)

    # ================= MACD =================
    fig.add_trace(go.Scatter(
        x=data['date'], y=data['MACD'],
        mode='lines',
        name='MACD',
        legendgroup=f'macd_{ticker}',
        legend='legend3',
        visible=(i==0),
        line=dict(color='blue'),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>MACD: %{y:.4f}<extra></extra>"
    ), row=3, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['MACD_signal'],
        mode='lines',
        name='Signal',
        legendgroup=f'macd_{ticker}',
        legend='legend3',
        visible=(i==0),
        line=dict(color='orange'),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>Signal: %{y:.4f}<extra></extra>"
    ), row=3, col=1)

    fig.add_trace(go.Scatter(
        x=data['date'], y=data['MACD_hist'],
        mode='lines',
        name='Histogram',
        legendgroup=f'macd_{ticker}',
        legend='legend3',
        visible=(i==0),
        line=dict(color='red', dash='dot'),
        hovertemplate="Date: %{x|%Y-%m-%d}<br>Histogram: %{y:.4f}<extra></extra>"
    ), row=3, col=1)

# Dropdown
buttons = []
for i, ticker in enumerate(tickers):
    buttons.append(dict(
        label=ticker,
        method="update",
        args=[
            {"visible": visibility_matrix[i]},
            {"title": f"{ticker} - Technical Dashboard"}
        ]
    ))

# Reference lines
fig.add_hline(y=70, line_dash="dash", row=2, col=1)
fig.add_hline(y=40, line_dash="dash", row=2, col=1)
fig.add_hline(y=0, line_dash="dash", row=3, col=1)

fig.update_layout(
    height=1000,
    hovermode="x unified",

    legend1=dict(x=1.02, y=0.92, xanchor="left"),
    legend2=dict(x=1.02, y=0.55, xanchor="left"),
    legend3=dict(x=1.02, y=0.18, xanchor="left"),

    updatemenus=[dict(
        buttons=buttons,
        direction="down",
        x=0.01,
        y=1.08,
        xanchor="left",
        yanchor="top"
    )],

    margin=dict(l=60, r=160, t=80, b=60)
)

# Add padding so first & last dates are visible
fig.update_xaxes(
    type="date",
    range=[min_date - date_padding, max_date + date_padding]
)

fig.show()

Part 2. Quantitative Analysis

2.1 Objective

The purpose of this section is to evaluate each stock from a quantitative investment perspective by measuring three core aspects:

  • Historical returns (how much value was created)

  • Risk exposure (how much volatility and drawdown investors endured)

  • Risk-adjusted performance (how efficiently returns were generated relative to risk)

Unlike technical analysis, which focuses on timing and momentum, quantitative metrics assess structural, long-term performance. The goal is to identify stocks that not only delivered strong returns but did so with disciplined risk management.

2.2. Historical Returns

Return measures how much value an investment generates over time. I evaluate three complementary metrics:

  • Daily Return: Percentage change in adjusted close price from one day to the next.

  • Total Return: Cumulative growth from the first to the last trading day.

  • CAGR (Compound Annual Growth Rate): The smoothed annual return that equates the starting value to the ending value over the full period. CAGR is especially valuable for medium- to long-term investors because it accounts for compounding and removes the distorting effect of volatility.

Code
# Ensure proper datetime format
df = df_tech.copy()
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['ticker', 'date'])

# -----------------------------
# Daily Returns
# -----------------------------
df['daily_return'] = df.groupby('ticker')['adj_close'].pct_change()

# -----------------------------
# Cumulative Returns
# -----------------------------
df['cum_return'] = ( df.groupby('ticker')['daily_return'] .transform(lambda x: (1 + x).cumprod()) )

# Ensure datetime format
df['date'] = pd.to_datetime(df['date'])
# Get full date range for padding
min_date = df['date'].min()
max_date = df['date'].max()
date_padding = pd.Timedelta(days=10)

# Get latest cumulative return for each ticker
latest_perf = (
    df.sort_values('date')
      .groupby('ticker')
      .tail(1)[['ticker', 'cum_return']]
      .sort_values('cum_return', ascending=False)
)

ordered_tickers = latest_perf['ticker'].tolist()

df['ticker'] = pd.Categorical(
    df['ticker'],
    categories=ordered_tickers,
    ordered=True
)

df = df.sort_values(['ticker', 'date'])

# Build figure
fig_cum = px.line(
    df,
    x='date',
    y='cum_return',
    color='ticker',
    category_orders={"ticker": ordered_tickers},
    title='Cumulative Return Over Time'
)

# Force daily format + unified hover
fig_cum.update_traces(
    hovertemplate=
    "<b>%{fullData.name}</b><br>" +
    "Date: %{x|%Y-%m-%d}<br>" +
    "Cumulative Return: %{y:.2f}<extra></extra>"
)

fig_cum.update_layout(
    height=500,
    yaxis_title="Cumulative Return (Growth of $1)",
    xaxis_title="Date",
    legend_title="Ticker",
    xaxis=dict(
        type="date",
        hoverformat="%Y-%m-%d"
    )
)

# Add padding so first & last dates are visible
fig.update_xaxes(
    type="date",
    range=[min_date - date_padding, max_date + date_padding]
)

fig_cum.show()
Code
# -----------------------------
# Total Return
# -----------------------------
total_return = (
    df.groupby('ticker')['adj_close']
    .agg(lambda x: x.iloc[-1] / x.iloc[0] - 1)
    .reset_index(name='Total Return')
)

# -----------------------------
# CAGR
# -----------------------------
years = (df['date'].max() - df['date'].min()).days / 365

cagr = (
    df.groupby('ticker')['adj_close']
    .agg(lambda x: (x.iloc[-1] / x.iloc[0]) ** (1 / years) - 1)
    .reset_index(name='CAGR')
)

mean_daily = (
    df.groupby('ticker')['daily_return']
    .mean()
    .reset_index(name='Mean Daily Return')
)

# -----------------------------
# Mean Daily Return
# -----------------------------

mean_daily = (
    df.groupby('ticker')['daily_return']
    .mean()
    .reset_index(name='Mean Daily Return')
)

# -----------------------------
# Merge All Return Metrics
# -----------------------------

return_table = (
    total_return
    .merge(mean_daily, on='ticker')
    .merge(cagr, on='ticker')
)

# -----------------------------
# Display Table
# -----------------------------

return_table.sort_values('CAGR', ascending=False)\
    .style.format({
        'Total Return': '{:.2%}',
        'Mean Daily Return': '{:.4%}',
        'CAGR': '{:.2%}'
    })

# Display Return Table (no index, styled header)
return_table_styled = (
    return_table
    .sort_values('CAGR', ascending=False)
    .style
    .format({
        'Total Return': '{:.2%}',
        'Mean Daily Return': '{:.4%}',
        'CAGR': '{:.2%}'
    })
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')   # Xóa số thứ tự
)

return_table_styled
ticker Total Return Mean Daily Return CAGR
NVDA 1229.10% 0.2524% 65.07%
GOOGL 276.50% 0.1213% 29.28%
JPM 194.13% 0.0951% 23.25%
WMT 177.32% 0.0875% 21.85%
META 143.70% 0.1063% 18.84%
AAPL 123.70% 0.0775% 16.88%
BAC 111.21% 0.0722% 15.59%
MSFT 99.99% 0.0670% 14.37%
TSLA 97.25% 0.1244% 14.07%
JNJ 80.36% 0.0512% 12.10%
KO 68.66% 0.0455% 10.66%
MA 66.16% 0.0508% 10.34%
HD 63.90% 0.0492% 10.05%
V 60.39% 0.0467% 9.59%
NFLX 54.99% 0.0718% 8.86%
AMZN 51.12% 0.0560% 8.33%
PG 27.85% 0.0249% 4.87%
UNH -11.96% 0.0096% -2.44%
DIS -31.05% -0.0112% -6.95%
ADBE -44.76% -0.0210% -10.86%
PYPL -80.82% -0.0912% -27.38%

Key Findings:

  • NVDA delivered exceptional performance with a CAGR of 65.07%.

  • Strong performers also include GOOGL, JPM, and WMT.

  • Several stocks (PYPL, ADBE, DIS) posted negative returns over the period.

2.3 Risk Analysis

I evaluate risk using three complementary metrics:

  • Annualized Volatility: Standard deviation of daily returns, scaled by √252 (trading days per year). It captures total price fluctuation.

  • Maximum Drawdown: The largest peak-to-trough decline. It answers: “What was the worst loss an investor would have suffered if they bought at the peak?”

  • Downside Deviation: Volatility of only negative returns. It focuses purely on harmful downside risk, making it more relevant for long-term investors than total volatility.

Code
# -----------------------------
# Annualized Volatility
# -----------------------------
volatility = (
    df.groupby('ticker')['daily_return']
    .std()
    .reset_index(name='Volatility')
)

volatility['Volatility'] *= np.sqrt(252)

# -----------------------------
# Max Drawdown
# -----------------------------
def max_drawdown(series):
    cumulative = (1 + series).cumprod()
    peak = cumulative.cummax()
    drawdown = cumulative / peak - 1
    return drawdown.min()

max_dd = (
    df.groupby('ticker')['daily_return']
    .apply(max_drawdown)
    .reset_index(name='Max Drawdown')
)

# -----------------------------
# Downside Deviation
# -----------------------------
def downside_dev(series):
    negative_returns = series[series < 0]
    return negative_returns.std() * np.sqrt(252)

downside = (
    df.groupby('ticker')['daily_return']
    .apply(downside_dev)
    .reset_index(name='Downside Deviation')
)

#

risk_table = volatility.merge(max_dd, on='ticker') \
                       .merge(downside, on='ticker')

risk_table.sort_values('Volatility', ascending=False)

# Display table

risk_table_styled = (
    risk_table
    .sort_values('Volatility', ascending=False)
    .style
    .format({
        'Volatility': '{:.2%}',
        'Max Drawdown': '{:.2%}',
        'Downside Deviation': '{:.2%}'
    })
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')
)

risk_table_styled
ticker Volatility Max Drawdown Downside Deviation
TSLA 60.39% -73.63% 38.64%
NVDA 51.60% -66.34% 32.15%
META 43.25% -76.74% 32.76%
NFLX 42.90% -75.95% 34.69%
PYPL 42.30% -86.45% 33.41%
ADBE 35.13% -60.50% 28.77%
AMZN 34.85% -56.15% 24.05%
GOOGL 30.82% -44.32% 20.87%
UNH 30.72% -61.39% 28.59%
DIS 29.76% -60.72% 21.20%
AAPL 27.83% -33.36% 18.78%
BAC 26.95% -46.64% 17.94%
MSFT 26.02% -37.15% 18.10%
JPM 24.28% -38.77% 17.12%
MA 24.17% -28.25% 17.36%
HD 23.56% -34.73% 16.34%
V 22.68% -28.60% 16.28%
WMT 20.84% -25.74% 16.21%
PG 17.26% -23.77% 12.68%
JNJ 16.78% -18.41% 11.35%
KO 15.94% -17.27% 11.20%
Code
risk_return = return_table.merge(risk_table, on='ticker')

fig_scatter = px.scatter(
    risk_return,
    x='Volatility',
    y='CAGR',
    text='ticker',
    title='Risk vs Return (CAGR vs Volatility)'
)

fig_scatter.update_traces(
    textposition='top center',
    marker=dict(size=12),
    hovertemplate=
        "<b>%{text}</b><br>" +
        "CAGR: %{y:.2%}<br>" +
        "Volatility: %{x:.2%}<br>" +
        "Max Drawdown: %{customdata[0]:.2%}<br>" +
        "<extra></extra>",
    customdata=risk_return[['Max Drawdown']]
)

fig_scatter.update_layout(
    height=500,
    hovermode="closest"
)

fig_scatter.show()

Key observations:

  • NVDA stands out in the top-right quadrant: extremely high return but also the highest volatility. This reflects its aggressive growth profile - suitable for investors with high risk tolerance.

  • WMT, JPM, and GOOGL occupy a favorable middle zone: strong CAGR with relatively moderate volatility, indicating efficient return generation.

  • PYPL and TSLA show high volatility combined with poor or negative returns - the least attractive risk-return profile.

  • Defensive names such as JNJ and KO appear on the left side (low volatility) with modest but stable returns, ideal for conservative portfolios.

2.4. Risk-Adjusted Performance

To evaluate how efficiently returns were generated per unit of risk, I compute three widely respected ratios:

Sharpe Ratio = (CAGR – Risk-free rate) / Annualized Volatility

This is the most common risk-adjusted metric. It shows excess return (above a risk-free benchmark, here 2%) per unit of total volatility. A Sharpe Ratio > 1.0 is generally considered excellent; values above 0.5 are acceptable for equity strategies.

Sortino Ratio = (CAGR – Risk-free rate) / Downside Deviation

An improvement over the Sharpe Ratio because it only penalizes harmful (downside) volatility. It is particularly useful for equity investors who are more concerned about large losses than upward swings. Higher Sortino values indicate stronger protection against downside risk.

Calmar Ratio = CAGR / |Maximum Drawdown|

This ratio focuses on drawdown risk, which is highly relevant for long-term holding periods. It measures how much annual return is earned for every 1% of the worst historical loss. A Calmar Ratio > 0.5 is often viewed as attractive for medium- to long-term strategies.

Code
risk_free_rate = 0.02

metrics = risk_return.copy()

# Sharpe
metrics['Sharpe'] = (
    (metrics['CAGR'] - risk_free_rate) / metrics['Volatility']
)

# Sortino
metrics['Sortino'] = (
    (metrics['CAGR'] - risk_free_rate) / metrics['Downside Deviation']
)

# Calmar
metrics['Calmar'] = (
    metrics['CAGR'] / abs(metrics['Max Drawdown'])
)

# 

metrics = metrics[
    ['ticker','CAGR','Volatility','Max Drawdown',
     'Sharpe','Sortino','Calmar']
]

mt = metrics[
    ['ticker','Sharpe','Sortino','Calmar']
]

mt.sort_values('Sharpe', ascending=False)

# Display table
risk_adjusted_styled = (
    metrics[['ticker', 'Sharpe', 'Sortino', 'Calmar']]
    .sort_values('Sharpe', ascending=False)
    .style
    .format({
        'Sharpe': '{:.3f}',
        'Sortino': '{:.3f}',
        'Calmar': '{:.3f}'
    })
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')
)

risk_adjusted_styled
ticker Sharpe Sortino Calmar
NVDA 1.222 1.962 0.981
WMT 0.952 1.224 0.849
GOOGL 0.885 1.307 0.661
JPM 0.875 1.241 0.600
JNJ 0.602 0.890 0.657
KO 0.543 0.773 0.617
AAPL 0.535 0.792 0.506
BAC 0.504 0.757 0.334
MSFT 0.475 0.683 0.387
META 0.389 0.514 0.245
MA 0.345 0.480 0.366
HD 0.341 0.492 0.289
V 0.334 0.466 0.335
TSLA 0.200 0.312 0.191
AMZN 0.182 0.263 0.148
PG 0.167 0.227 0.205
NFLX 0.160 0.198 0.117
UNH -0.144 -0.155 -0.040
DIS -0.301 -0.422 -0.114
ADBE -0.366 -0.447 -0.180
PYPL -0.695 -0.879 -0.317

Key findings

  • NVDA leads with the highest Sharpe (1.22) and Sortino (1.96), showing it delivered exceptional excess return even after accounting for its high volatility.

  • WMT and JPM also score very well, offering strong returns with better risk control.

  • Negative ratios (UNH, DIS, ADBE, PYPL) indicate that these stocks failed to compensate investors adequately for the risk endured.

2.5. Quantitative Scoring

To create a single, objective ranking that combines return, risk, and efficiency, I developed a Quantitative Score ranging from 0 to 10.

Step-by-step methodology:

  1. Min-Max Normalization:

Each raw metric is scaled to a 0–1 range so they become comparable:

  • For “higher is better” metrics (CAGR, Sharpe): score = (value - min) / (max - min)

  • For “lower is better” risk metrics (Volatility, Max Drawdown): score = 1 - (value - min) / (max - min)

This removes differences in scale and units.

  1. Weighted Aggregation

The normalized scores are combined using the following weights, chosen to reflect long-term investor priorities:

  • 40% CAGR → Emphasizes actual wealth creation (growth is the ultimate goal).

  • 30% Sharpe Ratio → Rewards efficient return per unit of total risk.

  • 20% Volatility control → Penalizes excessive price swings.

  • 10% Drawdown control → Gives modest weight to capital preservation (the worst historical loss).

These weights prioritize sustainable compounding while still penalizing high risk.

  1. Final Scaling

The weighted sum is multiplied by 10 to produce a clean 0–10 score.

  • Score > 7.0: Excellent risk-return profile (core holdings).

  • Score 6 – 7.0: Solid, well-balanced candidates.

  • Score < 4.0: Weak or high-risk underperformers.

Code
# Step 1 – Normalize Metrics (Min–Max)
score_df = metrics.copy()

def minmax(series):
    return (series - series.min()) / (series.max() - series.min())

score_df['Return_score'] = minmax(score_df['CAGR'])
score_df['Sharpe_score'] = minmax(score_df['Sharpe'])
score_df['Vol_score'] = 1 - minmax(score_df['Volatility'])
score_df['DD_score'] = 1 - minmax(abs(score_df['Max Drawdown']))

# Step 2 – Weighted Quant Score
score_df['Quant Score'] = (
    0.4 * score_df['Return_score'] +
    0.3 * score_df['Sharpe_score'] +
    0.2 * score_df['Vol_score'] +
    0.1 * score_df['DD_score']
)

score_df['Quant Score'] *= 10

score_df = score_df.sort_values('Quant Score', ascending=False)

score_df[['ticker','Quant Score']]

# Step 3 - Quant Score Ranking 
# Round values for presentation
display_table = score_df.copy()

display_table = display_table[[
    'ticker',
    'CAGR',
    'Volatility',
    'Max Drawdown',
    'Sharpe',
    'Return_score',
    'Sharpe_score',
    'Vol_score',
    'DD_score',
    'Quant Score'
]]

display_table = display_table.round(3)

display_table = display_table.sort_values('Quant Score', ascending=False)

def quant_score_style(val):
    if val > 7.0:
        return 'background-color: #006400; color: white; font-weight: bold'   # Dark green
    elif val >= 6:
        return 'background-color: #90EE90; color: black; font-weight: bold'   # Light green
    elif val < 4.0:
        return 'background-color: #FFB6C1; color: black; font-weight: bold'   # Light red
    else:
        return 'background-color: #F0E68C; color: black; font-weight: bold'

# Create styled table
quant_styled = (
    display_table
    .style
    .format({
        'CAGR': '{:.3f}',
        'Volatility': '{:.3f}',
        'Max Drawdown': '{:.3f}',
        'Sharpe': '{:.3f}',
        'Quant Score': '{:.2f}'
    })
    .map(quant_score_style, subset=['Quant Score'])
    .set_table_styles([
        {'selector': 'th', 
         'props': [('background-color', 'steelblue'), 
                   ('color', 'white'), 
                   ('font-weight', 'bold')]}
    ])
    .hide(axis='index')
    .set_caption("Quantitative Score Ranking (with interpretation)")
)

quant_styled
Table 1: Quantitative Score Ranking (with interpretation)
ticker CAGR Volatility Max Drawdown Sharpe Return_score Sharpe_score Vol_score DD_score Quant Score
NVDA 0.651 0.516 -0.663 1.222 1.000000 1.000000 0.198000 0.291000 7.69
WMT 0.218 0.208 -0.257 0.952 0.532000 0.859000 0.890000 0.878000 7.37
JPM 0.232 0.243 -0.388 0.875 0.548000 0.819000 0.812000 0.689000 6.96
GOOGL 0.293 0.308 -0.443 0.885 0.613000 0.824000 0.665000 0.609000 6.86
JNJ 0.121 0.168 -0.184 0.602 0.427000 0.676000 0.981000 0.984000 6.68
KO 0.107 0.159 -0.173 0.543 0.411000 0.646000 1.000000 1.000000 6.58
AAPL 0.169 0.278 -0.334 0.535 0.479000 0.641000 0.733000 0.767000 6.07
MSFT 0.144 0.260 -0.371 0.475 0.452000 0.610000 0.773000 0.713000 5.90
BAC 0.156 0.269 -0.466 0.504 0.465000 0.625000 0.752000 0.575000 5.82
V 0.096 0.227 -0.286 0.334 0.400000 0.537000 0.848000 0.836000 5.74
MA 0.103 0.242 -0.283 0.345 0.408000 0.542000 0.815000 0.841000 5.73
HD 0.100 0.236 -0.347 0.341 0.405000 0.540000 0.829000 0.748000 5.64
PG 0.049 0.173 -0.238 0.167 0.349000 0.449000 0.970000 0.906000 5.59
META 0.188 0.432 -0.767 0.389 0.500000 0.565000 0.386000 0.140000 4.61
AMZN 0.083 0.349 -0.561 0.182 0.386000 0.457000 0.575000 0.438000 4.50
NFLX 0.089 0.429 -0.759 0.160 0.392000 0.446000 0.393000 0.152000 3.84
UNH -0.024 0.307 -0.614 -0.144 0.270000 0.287000 0.667000 0.362000 3.64
TSLA 0.141 0.604 -0.736 0.200 0.448000 0.467000 0.000000 0.185000 3.38
DIS -0.070 0.298 -0.607 -0.301 0.221000 0.205000 0.689000 0.372000 3.25
ADBE -0.109 0.351 -0.605 -0.366 0.179000 0.171000 0.568000 0.375000 2.74
PYPL -0.274 0.423 -0.865 -0.695 0.000000 0.000000 0.407000 0.000000 0.81

Key takeaway:

NVDA leads due to its extraordinary growth, while WMT ranks second because of its outstanding balance between solid returns and very low risk.

Part 3. Machine Learning

3.1. Objective

The goal of this section is not to predict exact future prices.

Instead, I aim to:

• Forecast future trend

• Estimate expected return

• Predict direction (up/down)

• Convert predictions into an ML investment signal

This approach is more realistic and aligned with quantitative investing practice.

3.2. Feature Engineering

Feature engineering is the most important part of ML in finance. I construct predictive variables based on:

Momentum & Trend

  • Lag returns (1, 5, 10, 20 days)

  • RSI

  • MACD

  • ATR

Long-Term Structure

  • Price / SMA200 ratio

Liquidity

  • Volume change

I aim to predict future_return_20d - this represents expected 1-month forward return.

Code
df_ml = df_tech.copy()
df_ml = df_ml.sort_values(['ticker', 'date'])

# =============================
# Create Lag Returns
# =============================

for lag in [1, 5, 10, 20]:
    df_ml[f'return_{lag}d'] = (
        df_ml.groupby('ticker')['close']
        .pct_change(lag)
    )

# =============================
# Volume Change
# =============================

df_ml['volume_change_5d'] = (
    df_ml.groupby('ticker')['volume']
    .pct_change(5)
)

# =============================
# Price / SMA Ratio
# =============================

df_ml['price_sma200_ratio'] = df_ml['close'] / df_ml['SMA200']

# =============================
# Target Variable (20-day forward return)
# =============================

df_ml['future_return_20d'] = (
    df_ml.groupby('ticker')['close']
    .shift(-20) / df_ml['close'] - 1
)

# Drop NA
df_ml = df_ml.dropna()

df_ml.head()
id ticker date open high low close adj_close volume SMA50 ... MACD_signal MACD_hist ATR return_1d return_5d return_10d return_20d volume_change_5d price_sma200_ratio future_return_20d
199 200 AAPL 2021-09-22 141.2759 143.2124 140.5424 142.6451 142.6451 76404300 144.960510 ... -1.102816 0.348422 2.864782 0.016873 -0.021338 -0.059699 -0.025197 -0.082576 1.088618 0.023380
200 201 AAPL 2021-09-23 143.4275 143.8481 142.4397 143.6036 143.6036 64838200 144.919494 ... -0.887100 0.126647 2.760755 0.006719 -0.013173 -0.046992 -0.010313 -0.046975 1.094966 0.018048
201 202 AAPL 2021-09-24 142.4593 144.2295 142.3615 143.6916 143.6916 53477900 144.893324 ... -0.700928 -0.048585 2.696987 0.000613 0.005888 -0.013761 -0.004202 -0.588216 1.094693 0.012047
202 203 AAPL 2021-09-27 142.2735 142.7527 140.6597 142.1757 142.1757 74150700 144.877656 ... -0.643794 -0.209533 2.720909 -0.010550 0.017000 -0.027950 -0.021736 -0.399487 1.082171 0.022494
203 204 AAPL 2021-09-28 140.1022 141.5693 138.5765 138.7917 138.7917 108972300 144.871262 ... -0.788283 -0.406604 2.783644 -0.023802 -0.010597 -0.041925 -0.073211 0.436985 1.055657 0.052216

5 rows × 23 columns

3.3. Model Selection

I choose to use XGBoost Regressor because this model:

  • Handles nonlinear relationships

  • Robust to multicollinearity

  • Performs well on tabular financial data

  • Interpretable via feature importance

I use walk-forward split to train on past and test on future.

Code
# ===============================
# Prepare Data (Sort First)
# ===============================

df_ml = df_ml.sort_values(['date', 'ticker']).reset_index(drop=True)

# Feature columns
features = [
    'return_1d', 'return_5d', 'return_10d', 'return_20d',
    'RSI', 'MACD', 'ATR',
    'price_sma200_ratio',
    'volume_change_5d'
]

target = 'future_return_20d'


# ===============================
# Time-Based Split
# ===============================

# Determine split date (80% time)
split_date = df_ml['date'].quantile(0.8)

train = df_ml[df_ml['date'] <= split_date].copy()
test  = df_ml[df_ml['date'] > split_date].copy()

print("Train period:", train['date'].min(), "→", train['date'].max())
print("Test period:", test['date'].min(), "→", test['date'].max())

print("Train tickers:", train['ticker'].nunique())
print("Test tickers:", test['ticker'].nunique())


# ===============================
# Train-Test Sets
# ===============================

X_train = train[features]
y_train = train[target]

X_test  = test[features]
y_test  = test[target]


# ===============================
# Model Training
# ===============================

model = XGBRegressor(
    n_estimators=300,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train)


# ===============================
# Prediction
# ===============================

y_pred = model.predict(X_test)
Train period: 2021-09-22 00:00:00 → 2025-02-26 00:00:00
Test period: 2025-02-27 00:00:00 → 2026-01-05 00:00:00
Train tickers: 21
Test tickers: 21

3.4. Model Evaluation

We evaluate the model using:

  • RMSE: To measure prediction error magnitude.

  • Directional Accuracy: To see the percentage of times model correctly predicts direction.

Code
# RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Directional Accuracy
direction_actual = np.sign(y_test)
direction_pred = np.sign(y_pred)

directional_accuracy = (direction_actual == direction_pred).mean()

print("RMSE:", round(rmse, 4))
print("Directional Accuracy:", round(directional_accuracy, 4))
RMSE: 0.0906
Directional Accuracy: 0.5316
Code
results_df = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred
})

fig = px.scatter(
    results_df,
    x='Actual',
    y='Predicted',
    title='Actual vs Predicted 20-Day Returns'
)

fig.add_shape(
    type='line',
    x0=results_df['Actual'].min(),
    y0=results_df['Actual'].min(),
    x1=results_df['Actual'].max(),
    y1=results_df['Actual'].max()
)

fig.update_layout(height=500)

fig.show()
Code
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

fig_imp = px.bar(
    importance_df,
    x='Feature',
    y='Importance',
    title='Feature Importance'
)

fig_imp.update_layout(height=500)

fig_imp.show()

3.5 ML Signal Construction

After generating out-of-sample predictions from the trained XGBoost model, we construct the ML investment signal using the most recent predicted return for each ticker.

Specifically, for each stock, we extract the latest predicted 20-day forward return from the test period. This represents the model’s expected medium-term performance estimate based on current technical and structural features.

To convert predictions into an interpretable ranking metric, we normalize the predicted returns using Min-Max scaling to a standardized 0–10 ML Score:

  • Higher predicted return → Higher ML Score

  • Lower predicted return → Lower ML Score

This transformation ensures comparability across stocks and allows seamless integration into the final multi-factor scoring framework.

The resulting ML Score does not represent certainty of price movement, but rather the relative strength of expected forward return according to the model.

Code
# Attach predictions
test = test.copy()
test['predicted_return'] = y_pred

# Get latest prediction per ticker
ml_signal = (
    test
    .sort_values('date')
    .groupby('ticker')
    .tail(1)[['ticker', 'predicted_return']]
    .sort_values('predicted_return', ascending=False)
)

ml_signal.reset_index(drop=True)

# Normalize predicted returns:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler((0, 10))

ml_signal['ML Score'] = scaler.fit_transform(
    ml_signal[['predicted_return']]
)

ml_signal = ml_signal.sort_values('ML Score', ascending=False)

ml_signal
ticker predicted_return ML Score
22575 AAPL 0.047584 10.000000
22580 GOOGL 0.047144 9.941805
22588 NFLX 0.039374 8.914421
22595 WMT 0.022049 6.623508
22578 BAC 0.019512 6.287998
22589 NVDA 0.018553 6.161199
22587 MSFT 0.018195 6.113890
22583 JPM 0.017146 5.975072
22581 HD 0.011848 5.274515
22593 UNH 0.011781 5.265734
22577 AMZN 0.010119 5.045922
22591 PYPL 0.007110 4.648088
22586 META 0.006526 4.570768
22582 JNJ 0.006073 4.510870
22579 DIS 0.004332 4.280670
22590 PG 0.003566 4.179348
22584 KO -0.008488 2.585410
22585 MA -0.017626 1.377099
22594 V -0.018471 1.265449
22576 ADBE -0.025800 0.296191
22592 TSLA -0.028040 0.000000